{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Word2vec实战"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "导入gensim中的word2vec模型"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "from gensim.models import word2vec"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "用生成器的方式读取文件里的句子，适合读取大容量文件，而不用加载到内存"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "code_folding": []
   },
   "outputs": [],
   "source": [
    "class MySentences(object):\n",
    "    def __init__(self, fname):\n",
    "        self.fname = fname\n",
    " \n",
    "    def __iter__(self):\n",
    "        for line in open(self.fname, 'r'):\n",
    "            yield line.split()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "定义一个训练函数，指定输入的文件路径和待输出的模型路径"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "def W2vTrain(flie_input, model_output):         \n",
    "    sentences = MySentences(DataDir+file_input)\n",
    "    w2v_model = word2vec.Word2Vec(sentences, \n",
    "                                  min_count = MIN_COUNT, \n",
    "                                  workers = CPU_NUM, \n",
    "                                  size = VEC_SIZE,\n",
    "                                  window = CONTEXT_WINDOW\n",
    "                                 )\n",
    "    w2v_model.save(ModelDir+model_output)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "上面`word2vec.Word2Vec()`便是Word2vec在gensim中的实现，一些参数解释如下：\n",
    "\n",
    "- min_count: 对于词频 < min_count 的单词，将舍弃（其实最合适的方法是用 UNK 符号代替，即所谓的『未登录词』，这里简化起见，认为此类低频词不重要，直接抛弃）\n",
    "- workers: 可以并行执行的核心数，需要安装 Cython 才能起作用（安装 Cython 的方法很简单，直接 pip install cython）\n",
    "- size: 词向量的维度，即Skip-gram或CBOW模型隐藏层的节点数\n",
    "- window: 目标词汇的上下文单词距目标词的最长距离，很好理解，比如 CBOW 模型是用一个词的上下文预测这个词，那这个上下文总得有个限制，如果取得太多，距离目标词太远，有些词就没啥意义了，而如果取得太少，又信息不足，所以 window 就是上下文的一个最长距离"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "开始训练"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "DataDir = \"./\"\n",
    "ModelDir = \"./ipynb_garbage_files/\"\n",
    "MIN_COUNT = 5\n",
    "CPU_NUM = 2 # 需要预先安装 Cython 以支持并行\n",
    "VEC_SIZE = 20\n",
    "CONTEXT_WINDOW = 5 # 提取目标词上下文距离最长5个词\n",
    "\n",
    "file_input = \"bioCorpus_5000.txt\"\n",
    "model_output = \"test_w2v_model\"\n",
    "\n",
    "W2vTrain(file_input, model_output)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "加载模型"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "D:\\Anaconda3\\lib\\site-packages\\ipykernel_launcher.py:4: DeprecationWarning: Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead).\n",
      "  after removing the cwd from sys.path.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[('influence', 0.9991704821586609),\n",
       " ('in', 0.9991064667701721),\n",
       " ('membrane', 0.9990575909614563),\n",
       " ('beta', 0.9990391731262207),\n",
       " ('isolated', 0.9990122318267822),\n",
       " ('its', 0.9989871382713318),\n",
       " ('distribution', 0.9989349842071533),\n",
       " ('two', 0.9989125728607178),\n",
       " ('proceedings', 0.9989111423492432),\n",
       " ('binding', 0.9988784790039062)]"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "w2v_model = word2vec.Word2Vec.load(ModelDir+model_output)\n",
    "\n",
    "# 查找body这个词的近义词\n",
    "w2v_model.most_similar('body')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "左边一列是相似词，右边是相似度，可以看到，除了blood, cardiac, plasma这些比较make sense的词外，混入了with，to，after这些明显不make sense的词，在 NLP 里面我们叫『停用词』，也就是像常见代词、介词之类，造成这种结果的原因有二：\n",
    "\n",
    "- 参数设置不佳，比如`vec_size`设置的太小，导致这20个维度不足以capture单词间不同的信息，所以需要继续调整超参数\n",
    "- 数据集较小，因此停止词占据了太多信息量"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "导入NLTK包，去除停用词"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "from nltk.corpus import stopwords\n",
    "StopWords = stopwords.words('english')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "接下来需要在读取数据集每句话之后，对该句话进行处理，再扔进word2vec，所以需要定义一个新的训练函数，再原来基础上加上去除停用词的操作，并重新训练模型（为了方便对比，参数设置跟上面一样保持不变）"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "code_folding": []
   },
   "outputs": [],
   "source": [
    "# 重新训练，定义新的模型训练函数\n",
    "def W2vTrain_removeStopWords(file_input, model_output):         \n",
    "    sentences = list(MySentences(DataDir+file_input))\n",
    "    for idx,sentence in enumerate(sentences):\n",
    "        sentence = [w for w in sentence if w not in StopWords]\n",
    "        sentences[idx]=sentence\n",
    "    w2v_model = word2vec.Word2Vec(sentences, min_count = MIN_COUNT, \n",
    "                                  workers = CPU_NUM, size = VEC_SIZE)\n",
    "    w2v_model.save(ModelDir+model_output)\n",
    "\n",
    "W2vTrain_removeStopWords(file_input, model_output)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "D:\\Anaconda3\\lib\\site-packages\\ipykernel_launcher.py:2: DeprecationWarning: Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead).\n",
      "  \n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[('influence', 0.9866985082626343),\n",
       " ('distribution', 0.9854052066802979),\n",
       " ('membrane', 0.9842897057533264),\n",
       " ('two', 0.9840079545974731),\n",
       " ('isolated', 0.982103705406189),\n",
       " ('effects', 0.9817605018615723),\n",
       " ('muscle', 0.9813024401664734),\n",
       " ('drugs', 0.9807368516921997),\n",
       " ('beta', 0.9807223677635193),\n",
       " ('binding', 0.9806714057922363)]"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "w2v_model = word2vec.Word2Vec.load(ModelDir+model_output)\n",
    "w2v_model.most_similar('body')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "观察两次的结果，发现停用词的情况有所改善，但是效果依然不明显，可能原因有二，一是一些超参数需要细致的调参，二是语料库规模太小。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
